Matrix factorization routines on heterogeneous architectures

نویسندگان

Nikita Shustrov

Nadya Mozartova

Tamara Kashevarova

Konstantin Arturov

چکیده

In this work we consider a method for parallelizing matrix factorization algorithms on systems with Intel © Xeon Phi TM coprocessors. We provide performance results of matrix factorization routines implementing this approach and available in Intel © Math Kernel Library (Intel MKL) on the Intel © Xeon © processor line with Intel Xeon Phi coprocessors. Summary New heterogeneous systems consisting of a multicore CPU with coprocessors introduce new challenges to designing efficient parallel algorithms. Simultaneous usage of all computational resources of such systems for solving one large problem requires uneven distribution of data and computations that leads to more complex parallelization methods. In this work we present a new parallelization method for matrix factorization that efficiently utilizes all computational resources of a heterogeneous system consisting of a multicore CPU with coprocessors. We will show how this method can be applied for parallelization of the key linear algebra factorization algorithms (QR, LU, and Cholesky) on systems with Intel Xeon Phi coprocessors. Our matrix factorization method is based on the panel factorization approach [5]. The panel factorization approach has advantages over communication avoiding, tile methods [5], and their combination [8]: • no additional computational cost; • no additional memory consumption. The panel factorization approach has the same computational cost and memory usage as classic LAPACK algoFigure 1: Algorithm represented as DAG rithms [4], which makes this approach preferable for systems with coprocessors. The implementation preserves the LAPACK standard interfaces and data layout. The algorithm can be applied to any matrices. The implementation of our method is DAG-based [6] and uses panel factorization kernels that were redesigned and rewritten for new Intel Xeon Phi products [7] [9]. Figure 1 shows the algorithm represented as DAG. The algorithm implementation has the following features: • At the beginning, CPUs produce a number of factorized panels and send to coprocessors as many panels as needed to maximize coprocessor utilization. • Coprocessors perform ”update” stages in parallel. • CPUs perform both ”factorization” stages and ”update” stages in parallel. • To achieve the best load balance, a coprocessor may send a panel back to the CPU side on any process stage. The proposed method provides a high degree of parallelism while minimizing synchronizations and communications. The algorithm enables adaptable workload distribution between CPUs and coprocessors to improve load balancing, namely: • Adaptable data/task distribution on the fly between CPUs and coprocessors. • No limit on number of coprocessors on heterogenous systems. • Scalability. A system with CPUs and one coprocessor shows 3x performance improvement and a system with CPUs and two coprocessors shows 5x performance improvement. • No algorithmic limitations on matrix sizes. Our algorithm is implemented within the framework of Intel MKL [3] LU, QR and Cholesky factorization routines. The implemented routines detect the presence of Intel Xeon Phi coprocessors and automatically offload the computations that benefit from additional computational resources. This usage model hides the complexity of heterogeneous systems from the user, providing ease of use and the same API as usual Intel MKL routines. This parallelization method can be effectively applied toother LAPACK [4] algorithms. 1. REFERENCES[1] Netlib A collection of mathematical software, papers,and databases. http://www.netlib.org.[2] Jakub Kurzak, and Jack Dongarra. ImplementingLinear Algebra Routines on Multi-Core Processorswith Pipelining and a Look Ahead. UT-CS-06-581,September 2006. LAPACK Working note #178.[3] Intel© Math Kernel Library,http://www.intel.com/software/products/mkl.[4] LAPACK Linear Algebra PACKage,http://www.netlib.org/lapack.[5] Sergey V. Kuznetsov. An approach of the QRfactorization for tall-and-skinny matrices on multicoreplatforms, 11th International conference, Proceedingsof PARA 2012, Helsinki, Finland, LNCS, vol. 7782,Springer Verlag, pp. 235-249, 2013[6] A. Kobotov, S. V. Kuznetsov. Efficient dynamicparallelization for the QR factorization. InProceedings of PARA 2008: 9th InternationalWorkshop on State-of-the-Art in Scientific andParallel Computing, 2008.[7] Alexander Heinecke, Vaidyanathan Karthikeyan,Smelyanskiy Mikhail, Kobotov Alexander V, DubtsovRoman S, Henry Greg, Shet Aniruddha G, ChrysosGeorge, Dubey Pradeep. Design and Implementationof the Linpack Benchmark for Single and Multi-NodeSystems Based on Intel© Xeon PhiTMCoprocessor,IEEE International Parallel & Distributed ProcessingSymposium (IPDPS), 2013. [8] E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H.Ltaief, S. Thibault, and S. Tomov. ’QR Factorizationon a Multicore Node Enhanced with Multiple GPUAccelerators’, IPDPS 2011. [9] Michael Deisher, Mikhail Smelyanskiy, BrianNickerson, Victor W. Lee, Michael Chuvelev, PradeepDubey. Designing and Dynamically Load BalancingHybrid LU for Multi/Many-core, InternationalSupercomputer Conference, 2011.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A collection of parallel linear equations routines for the Denelcor HEP

This paper describes the implementation and performance results for a few standard linear algebra routines on the Denelcor HEP computer. The algorithms used here are based on high-level modules that facilitate portability and perform efficiently in a xvide range of environments:The modules are chosen to be of a large enough computational granularity so that reasonably optimum performance may be...

متن کامل

One-sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators

One-sided dense matrix factorizations are important computational kernels in many scientific and engineering simulations. In this paper, we propose two extensions of both right-looking (LU and QR) and left-looking (Cholesky) one-sided factorization algorithms to utilize the computing power of current heterogeneous architectures. We first describe a new class of non-GPU-resident algorithms that ...

متن کامل

One-sided dense matrix factorizations on a multicore with multiple GPU accelerators in MAGMA1

One-sided dense matrix factorizations are important computational kernels in many scientific and engineering simulations. In this paper, we propose two extensions of both right-looking (LU and QR) and left-looking (Cholesky) factorization algorithms to utilize the computing power of current heterogeneous architectures. We first describe a new class of non-GPU-resident algorithms that factorize ...

متن کامل

Ring-oriented Block Matrix Factorization Algorithms for Shared and Distributed Memory Architectures

Utilizing experiences from the implementations on shared memory multiprocessors (SMM) and distributed memory multicomputers (DMM), general ring-oriented routines are developed for the LU, Cholesky, and QR factorizations. Since, all machine dependencies are comprised to a small set of communication routines, the same factorization routines can be used on both the SMM and DMM architectures. The a...

متن کامل

A Modified Digital Image Watermarking Scheme Based on Nonnegative Matrix Factorization

This paper presents a modified digital image watermarking method based on nonnegative matrix factorization. Firstly, host image is factorized to the product of three nonnegative matrices. Then, the centric matrix is transferred to discrete cosine transform domain. Watermark is embedded in low frequency band of this matrix and next, the reverse of the transform is computed. Finally, watermarked ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

Matrix factorization routines on heterogeneous architectures

نویسندگان

چکیده

منابع مشابه

A collection of parallel linear equations routines for the Denelcor HEP

One-sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators

One-sided dense matrix factorizations on a multicore with multiple GPU accelerators in MAGMA1

Ring-oriented Block Matrix Factorization Algorithms for Shared and Distributed Memory Architectures

A Modified Digital Image Watermarking Scheme Based on Nonnegative Matrix Factorization

عنوان ژورنال:

اشتراک گذاری